There are 6 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32
| example_link | model | min_elo |
|---|---|---|
| HumanEval/93 | xwincoder-34b | 1194.236 |
| HumanEval/108 | claude-3-sonnet-20240229 | 1100.361 |
| HumanEval/137 | Qwen--Qwen1.5-72B-Chat | 1074.631 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| HumanEval/54 | 0.163 | -0.123 |
| HumanEval/55 | 0.878 | -0.119 |
| HumanEval/126 | 0.061 | -0.055 |
| HumanEval/137 | 0.020 | 0.025 |
| HumanEval/47 | 0.939 | 0.035 |
| HumanEval/108 | 0.020 | 0.042 |
| HumanEval/97 | 0.796 | 0.045 |
| HumanEval/122 | 0.673 | 0.047 |
| HumanEval/116 | 0.776 | 0.062 |
| HumanEval/11 | 0.857 | 0.082 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.